Systems and methods for improved processing of a data file

ABSTRACT

A data processing method includes receiving a data file and a layout file. The data file includes a plurality of data sets including a plurality of entries. Each entry resides at a respective predetermined location in the data file. The layout file includes information indicating the respective predetermined locations for a plurality of entry categories that correspond to the plurality of entries. The method also includes receiving an input including a search term and a search category, determining, based on the layout file, a search location for which an entry category matches the search category, identifying data sets in the data file having an entry that resides at the determined search location and that match the search term, dividing the data file into a plurality of output files based on the identified data sets, and parallelly processing the plurality of output files in a parallel processing system.

BACKGROUND OF THE DISCLOSURE

The field of the disclosure relates generally to systems and methods forimproved processing of a data file, and more specifically to systems andmethods for improving the processing of the data file by automaticallydividing the data file into manageable sections.

Most merchants keep records of their sales transactions. These recordsare valuable sources of information for marketing research. For example,these records can be analyzed for information that can be used toimprove the merchant's sales. While such records can still be kept inpaper format, the general trend nowadays is to tabulate the records indata files with file formats such as, for example, a MICROSOFT EXCEL®spreadsheet (Microsoft Excel is a registered trademark of MicrosoftCorporation, Redmond, Wash.). Compared to their paper counterparts, suchdata files are also easier to analyze because of the many functions thata computer can provide, from basic copying and pasting to more advancedtools that allow for table constructions, calculations, comparisons andmany more. Further, these data files can be provided to other companiesor third parties to conduct further research.

On the other hand, these data files can come in huge sizes possibly ingigabytes because of the sheer amount of transactions that can berecorded in a single data file. Therefore, the data file can become verydifficult to work with and processing the data files can become slowerand more tedious. A need therefore exists to provide methods and/orsystems to address the above problem.

BRIEF DESCRIPTION OF THE DISCLOSURE

According to a first aspect, a method for improving the processing of adata file by automatically dividing the data file into manageablesections is provided. The method is implemented by a computer devicecomprising at least one processor in communication with at least onememory. The method includes receiving, by the computer device, a datafile and a layout file. The data file includes a plurality of data setsincluding a plurality of entries. Each entry resides at a respectivepredetermined location in the data file. The layout file includesinformation indicating the respective predetermined locations for aplurality of entry categories that correspond to the plurality ofentries. The method also includes receiving, by the computer device, aninput including a first search term and a first search category,determining, by the computer device based on the layout file, a firstsearch location for which an entry category matches the first searchcategory, identifying, by the computer device, data sets in the datafile having an entry that resides at the determined first searchlocation and that match the first search term, and generating, by thecomputer device, an output file including the identified data sets.

According to a second aspect, a data processing method is provided. Themethod is implemented by at least a first processor and a secondprocessor in communication with at least one memory. The method includesreceiving, by the first processor, a data file and a layout file. Thedata file includes a plurality of data sets including a plurality ofentries. Each entry resides at a respective predetermined location inthe data file. The layout file includes information indicating therespective predetermined locations for a plurality of entry categoriesthat correspond to the plurality of entries. The method also includesreceiving, by the first processor, an input including a search term anda search category, determining, by the first processor based on thelayout file, a search location for which an entry category matches thesearch category, identifying, by the first processor, data sets in thedata file having an entry that resides at the determined search locationand that match the search term, dividing, by at least the firstprocessor, the data file into a plurality of output files based on theidentified data sets, and parallelly processing, by the first processorand the second processor, the plurality of output files in a parallelprocessing system.

According to a third aspect, at least one non-transitory computerreadable storage media having computer-executable instructions embodiedthereon is provided. When executed by at least one processor, thecomputer-executable instructions cause the processor to receive a datafile and a layout file. The data file includes a plurality of data setsincluding a plurality of entries. Each entry resides at a respectivepredetermined location in the data file. The layout file includesinformation indicating the respective predetermined locations for aplurality of entry categories that correspond to the plurality ofentries. The computer-executable instructions also cause the processorto receive an input including a search term and a search category,determine, based on the layout file, a search location for which anentry category matches the search category, identify data sets in thedata file having an entry that resides at the determined search locationand that match the search term, and generate an output file comprisingthe identified data sets.

According to a fourth aspect, a computer device for improving theprocessing of a data file by automatically dividing the data file intomanageable sections is provided. The computer device includes a receiverconfigured to receive a data file, a layout file, and an input. The datafile includes a plurality of data sets including a plurality of entries.Each entry resides at a respective predetermined location in the datafile. The layout file includes information indicating the respectivepredetermined locations for a plurality of entry categories thatcorrespond to the plurality of entries. The input includes a firstsearch term and a first search category. The computer device alsoincludes a determination circuit configured to determine, based on thelayout file, a first search location for which an entry category matchesthe first search category. The computer device further includes anidentification circuit configured to identify data sets in the data filehaving an entry that resides at the determined first search location andthat match the first search term. Moreover, the computer device includesa generator configured to generate an output file comprising theidentified data sets.

According to a fifth aspect, at least one non-transitory computerreadable storage media having computer-executable instructions embodiedthereon is provided. When executed by a parallel processing systemincluding at least a first processor and a second processor incommunication with at least one memory, the computer-executableinstructions cause at least one of the first processor and the secondprocessor to receive, by the first processor, a data file and a layoutfile. The data file includes a plurality of data sets including aplurality of entries. Each entry resides at a respective predeterminedlocation in the data file. The layout file includes informationindicating the respective predetermined locations for a plurality ofentry categories that correspond to the plurality of entries. Thecomputer-executable instructions also cause at least one of the firstprocessor and the second processor to receive, by the first processor,an input including a search term and a search category, determine, bythe first processor based on the layout file, a search location forwhich an entry category matches the search category, identify, by thefirst processor, data sets in the data file having an entry that residesat the determined search location and that match the search term,divide, by at least the first processor, the data file into a pluralityof output files based on the identified data sets, and parallellyprocess, by the first processor and the second processor, the pluralityof output files.

According to a sixth aspect, a data processing method is provided. Themethod is implemented by a computer device comprising at least oneprocessor in communication with at least one memory. The method includesreceiving a data file, splitting the data file into a plurality ofcountry level files, splitting each country level file into a pluralityof zone files, splitting each zone file into a plurality of sub-zonefiles, parallelly processing the plurality of sub-zone files in aparallel processing system, and recombining the plurality of parallellyprocessed sub-zone files to form respective processed zone files.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments and implementations are provided by way of example only, andwill be better understood and readily apparent to one of ordinary skillin the art from the following written description, read in conjunctionwith the drawings, in which:

FIG. 1 illustrates a flow diagram of a process for splitting a data fileaccording to various embodiments of the present disclosure;

FIG. 2 illustrates a file splitting device for implementing the processshown in FIG. 1;

FIG. 3 illustrates an information flow of the process shown in FIG. 1;

FIG. 4A illustrates an example of a data file according to variousembodiments of the present disclosure;

FIG. 4B illustrates an example of two output files according to variousembodiments of the present disclosure;

FIG. 4C illustrates an example of a layout file for splitting a fileaccording to various embodiments of the present disclosure;

FIG. 5A illustrates a flow diagram for processing a data file accordingto various embodiments of the present disclosure;

FIG. 5B illustrates a flow diagram of a method for processing a datafile according to various embodiments of the present disclosure; and

FIG. 6 depicts an exemplary computing device according to variousembodiments of the present disclosure.

DETAILED DESCRIPTION OF THE DISCLOSURE

Overview

Various embodiments provide devices and methods for improving theprocessing of the data file by automatically dividing or splitting thedata file into manageable sections. For the purposes of thisapplication, the terms dividing and splitting are interchangeable.

When merchants record their transactions in computer readable datafiles, these data files tend to become very big in file size due to thelarge amount of transactions recorded. As a result, the file becomesdifficult to open in editor applications such as, for example, MICROSOFTEXCEL®. Consequently, any form of data analysis on the data file will beslow and tedious. However, according to various embodiments, the datafile can be split or divided into smaller files first before processingor analyzing the data. With the smaller file size of the split files,the data becomes easier to work on.

According to various embodiments, devices and methods are provided withwhich a data file can be split into smaller files. In an example, a filesplitter is configured to receive a data file and a corresponding layoutfile. The layout file provides information that the file splitter canuse to interpret the received data file. This is because organization oftransaction records in data files may differ from merchant to merchant.For example, a transaction record in a data file may be represented by adata set, wherein the data set can include a number of entries thatprovide information regarding the transaction record. These entries arelocated at their respective entry categories. Correspondingly,information indicating locations of these entry categories are providedby the layout file.

The file splitter is also configured to receive an input including asearch term and search category. The file splitter determines, based onthe layout file, a search location for which an entry category matchesthe received search category. Based on the determined search location,data sets for which an entry that resides at the determined searchlocation and matches the received search term are identified. An outputfile that includes these identified data sets is then generated. It willbe understood that the input can include one or more search terms andsearch categories.

According to various embodiments, a plurality of output files may begenerated by the file splitter based on a plurality of inputs. Thisplurality of output files can be generated such that each output fileincludes approximately a same number of data sets, so that they can beprocessed in parallel by a parallel processing system more efficiently.

Advantageously, various embodiments can provide a big boost in terms oftime savings and greater efficiency when handling large data files.

Advantageously, with the devices and methods according to variousembodiments, data in large computer readable files can be splitaccording to preferred search terms.

TECHNICAL EFFECT

At least one of the technical problems addressed by this system mayinclude: (i) improving speed and efficiency of processing of large datafiles; (ii) improving the ability to apply parallel processing toanalyzing data files; and/or (iii) reducing the load on the computerwhen processing large data files.

The technical effect achieved by this system may be at least one of: (i)dividing large data files into smaller files based on user criteria;(ii) automated balancing of divided large data files to provide balancedprocessing to parallel processors; (iii) automated division or splittingof large data files into manageable sections; (v) improved speed ofparallel processing of large data files; and/or (vi) reintegratingdivided files after processing.

The methods and systems described herein may be implemented usingcomputer programming or engineering techniques including computersoftware, firmware, hardware, or any combination or subset thereof,wherein the technical effects may be achieved by performing at least oneof the following steps: (a) receiving a data file and a layout file,wherein the data file includes a plurality of data sets including aplurality of entries, wherein each entry resides at a respectivepredetermined location in the data file, wherein the layout fileincludes information indicating the respective predetermined locationsfor a plurality of entry categories that correspond to the plurality ofentries, and wherein the predetermined locations in the layout fileinclude a starting position and data length for each entry category; (b)receiving an input including a first search term, a first searchcategory, a second search term, and a second search category, whereinthe first search category comprises one of a country or a state, andwherein the first search term comprises one of a country code or a statecode; (c) determining, by the computer device based on the layout file,a first search location for which an entry category matches the firstsearch category; (d) identifying, by the computer device, data sets inthe data file having an entry that resides at the determined firstsearch location and that match the first search term; (e) determining asecond search location based on the second search category and thelayout file; (f) identifying data sets in the data file having a firstentry that resides at the determined first search location and thatmatches the first search term and that include a second entry thatresides at the second search location and matches the second searchterm; (g) extracting, from the data file, the identified data sets togenerate the output file; and (h) generating, by the computer device, anoutput file including the identified data sets.

Additional technical effects may be achieved by performing at least oneof the following steps: (a) receiving, by a first processor, a data fileand a layout file, wherein the data file includes a plurality of datasets including a plurality of entries, wherein each entry resides at arespective predetermined location in the data file, and wherein thelayout file includes information indicating the respective predeterminedlocations for a plurality of entry categories that correspond to theplurality of entries; (b) receiving, by the first processor, an inputincluding a search term and a search category; (c) determining, by thefirst processor based on the layout file, a search location for which anentry category matches the search category; (d) identifying, by thefirst processor, data sets in the data file having an entry that residesat the determined search location and that match the search term; (e)dividing, by at least the first processor, the data file into aplurality of output files based on the identified data sets; and (f)parallelly processing, by the first processor and a second processor,the plurality of output files in a parallel processing system, whereineach of the plurality of output files either has a difference in filesize not exceeding 20%, has a difference in file size not exceeding onedata set, or includes a same number of data sets.

Additional technical effects may be achieved by performing at least oneof the following steps: (a) receive a data file and a layout file,wherein the data file includes a plurality of data sets including aplurality of entries, wherein each entry resides at a respectivepredetermined location in the data file, and wherein the layout fileincludes information indicating the respective predetermined locationsfor a plurality of entry categories that correspond to the plurality ofentries; (b) receive an input including a search term and a searchcategory; (c) determine, based on the layout file, a search location forwhich an entry category matches the search category; (d) identify datasets in the data file having an entry that resides at the determinedsearch location and that match the search term; and (e) generate anoutput file comprising the identified data sets.

Additional technical effects may be achieved by performing at least oneof the following steps: (a) receiving a data file; (b) splitting thedata file into a plurality of country level files; (c) splitting eachcountry level file into a plurality of zone files; (d) splitting eachzone file into a plurality of sub-zone files; (e) parallelly processingthe plurality of sub-zone files in a parallel processing system; and (f)recombining the plurality of parallelly processed sub-zone files to formrespective processed zone files.

Terms Description (in Addition to Plain and Dictionary Meaning of Terms)

A data file is a computer readable file containing information that isorganized according to various entry categories. Such information can berecords of transactions that take place at a merchant outlet. An exampleof a data file is shown in FIG. 4A.

A data set is a transaction record shown in the data file. The data setis made up of a combination of entries. For example, a transactionrecord in a data file may be represented by a data set, wherein the dataset can include a number of entries that provide information regardingthe transaction record. These entries are located at their respectiveentry categories. In some embodiments, the data file includes aplurality of data sets, where each data set represents a differenttransaction record, such as a payment transaction.

An entry category is a word, text, phrase or value that defines the typeof entry that is located along a position of the entry category.Correspondingly, an entry is a word, text, phrase or value that residesalong a position of a representative entry category. For example, theentry category can be ‘state code’ and corresponding entries can be ‘NY’for New York, ‘LA’ for Louisiana, ‘MN’ for Minnesota and any other statecode. Referring to FIG. 4A, the entries as indicated in referenceportion 402 are ‘CAN’ for Canada and ‘GBR’ for Great Britain.Accordingly, the entry category for these entries is ‘country code’. Itis understood that there can be many other types of entry categories andcorresponding entries that can be used to provide information of atransaction record represented by a data set and organize information inthe data file.

A layout file is a computer readable file containing information of howa corresponding data file is organized. In various embodiments, itprovides information such as a starting position of an entry categoryand a data length of the entry category. An example of a layout file 414is shown in FIG. 4C.

A search term is a word or phrase that is used to identify data sets forwhich entry residing along a location of a search category matches thesearch term.

A search category is a word or phrase that is used to look up in thelayout file for identifying a position along which a search term residesin a data file. For example, a search term can be the country code ‘SG’for Singapore and the corresponding search category is ‘country code’.

A pre-processor in accordance with the present embodiment is a processorwhich pre-processes the data files to divide the data sets intocategories in response to data in the data files in an entry category ata position determined in accordance with the layout file. The terms‘pre-processor’, ‘file splitter’ and ‘file splitting device’ may be usedinterchangeably.

A parallel processing system is a system that is capable of performingmultiple computing tasks at any given time. For example, instead ofperforming ten tasks one at a time, a parallel processing system canperform all ten tasks at the same time. This is typically made possibleby utilizing, for example, ten processing nodes whereby each processingnode handles one of the ten tasks. Such processing nodes can be virtualprocessing nodes or physical processing servers that are connected to amain processor or controller to realize a parallel processing system. Anadvantage of utilizing parallel processing systems is that processingtime is greatly reduced compared to conventional processing systems.

In an embodiment, a processor unit includes a pre-processor. It may alsoinclude a parallel processing system. The pre-processor or multiplepre-processors do not need to be co-located with the parallel processingsystem or any node of the parallel processing system.

Exemplary Embodiments

Embodiments will be described, by way of example only, with reference tothe drawings. Like reference numerals and characters in the drawingsrefer to like elements or equivalents.

FIG. 1 illustrates a flow diagram of a process 100 for splitting a datafile according to various embodiments of the present disclosure. In step102, a data file and a layout file are received. The data file includesa plurality of data sets, where each data set may represent anindividual transaction record. Each data set includes one or moreentries, where each entry resides at a respective predetermined locationin the data file. The entries may represent different entry categoriesof the transaction record, such as, but not limited to, merchantidentifier, merchant name, merchant address, merchant city, merchantstate, merchant postal code, and county code. The layout file includesinformation indicating the respective predetermined locations orpositions in the data entry for the entry categories.

In the present embodiment, the information of locations or positions inthe layout file includes a starting position and data length for eachentry category. An example of the layout file 414 is shown in FIG. 4C,where the starting position for entry category ‘merchant state’ isdefined as ‘@106’ and the corresponding data length is defined as ‘$3’.

In step 104, an input including a search term and a search category isreceived. In the present embodiment, the search category includes one ofa country or a state. For example, the data file may be required to besplit into smaller files, each smaller file including data sets (ortransaction records) for a different country. For example, one smallerfile may contain only transaction records for payment transaction thatoccurred in Canada, while another smaller file may contain onlytransaction records for payment transaction that occurred in the UnitedStates. Therefore, the search category is specified as ‘country’ so thatdata sets are identified in accordance to countries. Likewise, thesearch category may be specified as ‘state’ if the data file has to besplit according to different states. It will be understood that thesearch category including one of a country or a state is usable only ifthe data file has a layout file containing location information forentry categories corresponding to ‘country’ and ‘state’. In otherembodiments, other categories, such as, but not limited to, city, postalcode, merchant category, merchant identifier, and/or any otheridentifier that may be used to divide or split the data file.

In the present embodiment, the search term includes a value for one of acountry code or a state code. Building on the example provided above,the search term used for splitting the data file into smaller filesincluding data sets for different countries should correspond to theentry used to represent different countries in the data file. Referringto FIG. 4A illustrating an example of a data file including data setshaving entries, with values such as ‘CAN’ and ‘GBR’, that reside along alocation of an entry category that matches the above-mentioned searchcategory (which in this case is ‘country’), the search term required forgenerating a file including data sets associated with Canada may be‘CAN’. Likewise, the search term required for generating a fileincluding data sets associated with Great Britain may be ‘GBR’. It willbe understood that the search term including one of a country code or astate code is usable only if the data file has entries corresponding to‘country code’ or ‘state code’, where such entries reside at a locationor position of an entry category that matches the search categorymentioned above. In some embodiments, the input further includes anadditional search term and an additional search category.

In step 106, it is determined, based on the layout file, a searchlocation for which an entry category matches the search category. In thepresent embodiment, the process 100 further includes determining anadditional search location based on the additional search category andthe layout file.

In step 108, data sets are identified in the data file, for which datasets include an entry that resides at the determined search location andmatches the search term. In the present embodiment, the process 100further includes identifying data sets in the data file, for which datasets include the entry that resides at the determined search locationand matches the search term, and for which data sets include an entrythat resides at the additional search location and matches theadditional search term. In some embodiments, the process 100 furtherincludes identifying data sets that include an entry that resides at thedetermined search location and matches the search term and the includean entry that resides at the additional search location and matches theadditional search term.

In step 110, an output file including the identified data sets isgenerated. In the present embodiment, the process 100 further includesextracting, from the data file, the identified data sets to generate theoutput file.

In an alternative embodiment, the process 100 further includesgenerating a plurality of output files, wherein each output file has adifference in file size not exceeding 20%. Advantageously, such arequirement is one way to make parallel processing of the split filesmore efficient, as will be further explained in the descriptions below.

In an embodiment, a difference in the number of data sets in each of theplurality of output files does not exceed one. Advantageously, such arequirement is a way to make parallel processing of the split files moreefficient, as will be further explained in the descriptions below.

In an embodiment, each of the plurality of output files includes a samenumber of data sets. Advantageously, such a requirement is a way to makeparallel processing of the split files more efficient, as will befurther explained in the descriptions below.

FIG. 2 illustrates a file splitting device 202 for implementing theprocess 100 shown in FIG. 1. The file splitting device 202 includes areceiver 204 (in other words: a receiver circuit) configured to receivea data file, a layout file and an input. The data file includes one ormore data sets, each data set including one or more entries, each entryresiding at a respective predetermined location in the data file. Thelayout file includes information indicating the respective predeterminedlocations for one or more entry categories. The input includes a searchterm and a search category. In the present embodiment, the searchcategory includes one of a country or a state. In the presentembodiment, the search term includes one of a country code or a statecode. In the present embodiment, the information of locations (orpositions) in the layout file includes a starting position and datalength for each entry category. In the present embodiment, the inputfurther includes an additional search term and an additional searchcategory.

The file splitting device 202 further includes a determination circuit206 configured to determine, based on the layout file, a search locationfor which an entry category matches the search category. In the presentembodiment, the determination circuit 206 is further configured todetermine an additional search location based on the additional searchcategory and the layout file.

The file splitting device 202 further includes an identification circuit208 configured to identify data sets in the data file, for which datasets include an entry that resides at the determined search location andmatches the search term. In the present embodiment, the identificationcircuit 208 is further configured to identify data sets in the datafile, for which data sets include an entry that resides at thedetermined search location and matches the search term and for whichdata sets include an entry that resides at the additional searchlocation and matches the additional search term.

The file splitting device 202 further includes a generator 210configured to generate an output file including the identified datasets. In the present embodiment, file splitting device 202 is furtherconfigured to extract, from the data file, the identified data sets togenerate the output file. In an embodiment, the generator 210 is furtherconfigured to generate a plurality of output files, wherein each outputfile has a difference in file size not exceeding 20%. In an embodiment,a difference in number of data sets in each of the plurality of outputfiles does not exceed one. In an embodiment, each of the plurality ofoutput files includes a same number of data sets.

In the present embodiment, a (for example non-transitory) computerreadable medium is provided which includes instructions which, whenexecuted by a processor, make the processor perform a file splittingmethod (for example process 100 shown in FIG. 1).

FIG. 3 illustrates an information flow 300 of the process 100 shown inFIG. 1. In the present embodiment, the information flow 300 is betweenan input device 302, a file splitter 304, and an output device 306. Itwill be understood by those skilled in the art that the various devicescan be facilitated by various entities and that various devices may becombined into one device.

According to various embodiments, as illustrated in FIG. 3, devices andmethods are provided to split (or divide) a data file.

A data file and a layout file are transmitted 310 to the file splitter304 from an input device 302. The input device 302 may be a serverassociated with a merchant for storing the data file and layout file.The channel through for sending the files to the file splitter 304 canbe, for example, a network such as the internet, a local area network(LAN), a virtual private network (VPN) and other similar networks. In anembodiment, the input device 302 may be a mobile device such as aportable hard disk, smartphone, universal serial bus (USB) drive orother similar device having stored thereon the data file and layoutfile, such that the files are transmitted 310 to the file splitter 304by plugging in the input device 302 into, for example, an input port ofthe file splitter 304. An example of the file splitter 304 is the filesplitting device 202 of FIG. 2. The file splitter 304 may also bereferred to as a pre-processor.

The file splitter 304, having received the data file and layout file,also receives 314 an input 308. The input 308 may be received 314 by thefile splitter 304 in a form of a programming script indicating a searchcategory and a search term on which the file splitting is to be basedon. The input 308 may also be received 314 by the file splitter 304through interaction with an input interface such as a keyboard, mouse,touch screen and other similar interface, such that an input 308including a search category and a search term are sent to the filesplitter 304. In an embodiment, an input 308 may include more than onesearch categories and more than one search terms. Further, a pluralityof inputs 308 may be received by the file splitter 304. For example, Mnumbers of inputs are shown in input 308, such that Input 1 includes Nnumbers of search categories and corresponding N numbers of searchterms; Input 2 includes P numbers of search categories and correspondingP numbers of search terms; and Input M includes Q numbers of searchcategories and corresponding Q numbers of search terms.

With the received data file, layout file, and input 308, the filesplitter 304 proceeds to perform the next step of determining, based onthe layout file, a search location for which an entry category matchesthe search category. For example, the search category may be ‘countrycode’. Referring to an example of a layout file 414 as illustrated inFIG. 4C, the file splitter 304 can determine that the search categorycorresponds to an entry category that begins at a starting position‘@119’ and spans a data length of ‘$3’. With this information, the filesplitter 304 can now perform the next step of identifying data sets inthe data file, for which data sets include an entry that resides at thedetermined search location and matches the search term. Referring to anexample of a data file 400 illustrated in FIG. 4A, the determinedlocation of the required entry category (@119 to @121) corresponds tothe portion 402 of the data file 400. The portion 402 includes twodifferent entries ‘CAN’ (for Canada) and ‘GBR’ (for Great Britain). If,for example, the search term is ‘CAN’, data sets containing the entry‘CAN’ will be identified by the file splitter.

After identifying the required data sets, the file splitter 304generates an output file including the identified data sets. Forexample, two inputs may be received 314 by the file splitter 304 tosplit the data file as shown in FIG. 4A. The first input includes asearch term ‘CAN’ and search category ‘country code’. The second inputincludes a search term ‘GBR’ and search category ‘country code’. Thefile splitter, having determined the location of the entry category thatmatches the search category at portion 402, identifies datasets of whichentries residing along portion 402 match the received search terms ‘CAN’and ‘GBR’. The identified data sets are then used for generating twooutput files (an output file for each input). An example of thegenerated output files 404 is illustrated in FIG. 4B. The generatedoutput file based on the first input includes data sets with the searchterm ‘CAN’, as can be seen in portion 406. The generated output filebased on the second input includes data sets with the search term ‘GBR’as shown in portion 408. The file splitter 304 may be further configuredto create file names for each generated output file. For example, thefile splitter 304 may name the first output file as‘mts_mia_data_CAN.dat’ as indicated in portion 410 and the file name forthe second output file may be ‘mts_mia_data_GBR.dat’ as indicated inportion 412. Accordingly, if inputs 1 to M of FIG. 3 are received 314 bythe file splitter 304, corresponding output files 1 to M will begenerated.

The generated output files may then be forwarded 312 to an output device306 for further processing. Such processing may be, for example,statistical analysis of the data sets in each output file such ascalculating an average transaction amount, analysis of transactionfrequencies during different times of the day or month or year,differences in transactions made in certain countries or states andother types of processing. Advantageously, the file splitter cangenerate output files including data sets with specific, user-definedsearch terms, such that output files can be generated based on the typeof data processing and analysis required.

It will be understood that the output device 306 may be a server,processer, mobile device, USB storage device or other similar devicethat can be used to store or process the generated output file. In anembodiment, the output device 306 is a parallel processor that enablesfaster processing for a plurality of output files, for instance aprocessor that utilizes a Hadoop architecture. For example, theplurality of output files 1 to M can each be processed by a node of aparallel processing system, such that processing time is significantlyshortened.

Such processing time under a parallel processing system can be furthershortened by ensuring that the generated data files are of approximatelythe same file size contain an approximately equal number of data sets orother similar methods. For example, if one of the plurality of outputfiles 1 to M includes twenty entries while the rest have only ten datasets each, the output file with twenty data sets will cause a bottleneckin processing time since a parallel processing system will takeapproximately twice the time to process the output file with twenty datasets as compared to the other output files. Therefore, it is ideal toensure that each output file includes about the same number of datasets, such that a parallel processing system can complete processing ofall the output files at approximately the same time.

In an embodiment, the file splitter 304 may form part of a processorunit that is configured to split data files into a plurality of smallerfiles. The processor unit may further include a parallel processingsystem to parallelly process the plurality of smaller files.

FIG. 5A illustrates a flow diagram 500 for processing a data file 502according to various embodiments of the present disclosure. The datafile 502 may first be split on a country level, in accordance with theprocess 100 illustrated in FIG. 1. For example, a plurality of inputs308 can be received by the file splitter 304 wherein each input includesa country code as the search term and ‘country code’ as thecorresponding search category, such that each input includes a differentcountry code as the search term. It will be understood that the choiceof country codes depends on the type of data required for analysis.Based on a corresponding layout file of data file 502, a search locationfor which an entry category in the layout file matches the searchcategory ‘country code’ is determined for each of the plurality ofinputs. Data sets are then identified in the data file 502, for whichdata sets include an entry that resides at the determined searchlocation and matches the required search term. The identified data setscan then be used for generating output files. For example, the filesplitter 304, in accordance with a first input including a search term‘country code 1’ (such as, for example, USA), can generate an outputfile 504 that includes identified data sets containing an entry thatmatches the search term of ‘USA’. Correspondingly, output files 506 to508 can similarly be generated based on the plurality of inputs asmentioned above. It will be understood that 506 represents a pluralityof output files that are generated between output files 504 and 508.

It may not be adequate for the file splitter to only split the data fileon a country level. This is because larger countries such as USA orChina tend to have significantly greater numbers of transactionsrecorded in such merchant data files as compared to the other countries.As a result, the generated output file including data sets for suchlarge countries may still be too big in file size to work efficientlyon. Therefore, a possible solution is to further split such files intozones, i.e. on a zone level as shown in FIG. 5A.

Starting with the generated output file 504 including data sets forsearch term ‘USA’, the output file 504 may be further split by the filesplitter 304 into a plurality of zone files 514, 516, 518. It will beunderstood that 516 represents a plurality of zone files that aregenerated between zone files 514 and 518. Each zone file may include acombination of data sets corresponding to states of the USA. Forexample, zone 1 file 514 may include data sets corresponding totransactions occurring in Florida, Minnesota and Louisiana. An inputincluding search terms ‘FL’, ‘MN’, and ‘LA’ (corresponding to statecodes for Florida, Minnesota, and Louisiana respectively) andcorresponding search category ‘state code’ may be received by the filesplitter 304 for generating zone 1 file 514. As another example, zone Mfile 518 may include data sets corresponding to transactions occurringin other US states such as Washington D.C., New York, and Texas.Therefore, an input including search terms ‘DC’, ‘NY’, and ‘TX’(corresponding to state codes for Washington D.C., New York, and Texasrespectively) and corresponding search category ‘state code’ may bereceived by the file splitter 304 for generating zone M file 518. Thegeneration of these zone files 514, 516, and 518 may be based on countrycode 1 file 504 or the data file 502, such that the data sets for whichentries match the respective search terms are identified from either thecountry code 1 file 504 or the data file 502.

Before the zone files 514, 516, and 518 are forwarded to a parallelprocessor 510 for processing, it is ideal to ensure that each file to beprocessed are of approximately equal size or include an approximatelysame number of data sets per file. Therefore, a sub-zone levelprocessing may be introduced whereby each of the zone files 514, 516,and 518 are split into smaller sub-zone files. For example, Zone 1 file514 may be split into P number of sub-zone files 524, 526, and 528, suchthat each sub-zone file includes an approximately same number of datasets from Zone 1 file 524. It will be understood that 526 represents aplurality of sub-zone files that are generated between sub-zone files524 and 528. In an embodiment, the difference in number of data setsamong the sub-zone files is only one. These sub-zone files 524, 526, and528 are then forwarded to a parallel processor 510 for processing. Theprocessed sub-zone files 534, 536, and 538 may then be merged back toform a processed zone 1 file 540. It will be understood that theremaining zone files 516 and 518 may similarly be split into sub-zonefiles and then processed by the parallel processor 510.

Advantageously, splitting zone files into sub-zone files including anapproximately same number of data sets can make parallel processing moreefficient by eliminating processing bottlenecks caused by processingfiles of different sizes in parallel.

FIG. 5B illustrates a flow diagram of a method 501 for processing a datafile 502 according to various embodiments of the present disclosure. Atstep 542, a data file is received. At step 544, the data file is splitinto a plurality of country level files. At step 546, each country levelfile is split into a plurality of zone files. At step 548, each zonefile is split into a plurality of sub-zone files. At step 550, theplurality of sub-zone files is parallelly processed in a parallelprocessing system. At step 552, the plurality of parallelly processedsub-zone files is recombined to form respective processed zone files.

FIG. 6 depicts an exemplary computing device 600, hereinafterinterchangeably referred to as a computer system 600 or as a server 600,where one or more such computing devices 600 may be used to implementthe file splitting device 202 shown in FIG. 2 and/or the file splitter304 shown in FIG. 3. The following description of the computing device600 is provided by way of example only and is not intended to belimiting.

As shown in FIG. 6, the example computing device 600 includes aprocessor 604 for executing software routines. Although a singleprocessor is shown for the sake of clarity, the computing device 600 mayalso include a multi-processor system. The processor 604 is connected toa communication infrastructure 606 for communication with othercomponents of the computing device 600. The communication infrastructure606 may include, for example, a communications bus, cross-bar, ornetwork.

The computing device 600 further includes a main memory 608, such as arandom access memory (RAM), and a secondary memory 610. The secondarymemory 610 may include, for example, a storage drive 612, which may be ahard disk drive, a solid state drive or a hybrid drive and/or aremovable storage drive 614, which may include a magnetic tape drive, anoptical disk drive, a solid state storage drive (such as a USB flashdrive, a flash memory device, a solid state drive or a memory card), orthe like. The removable storage drive 614 reads from and/or writes to aremovable storage medium 644 in a well-known manner. The removablestorage medium 644 may include magnetic tape, optical disk, non-volatilememory storage medium, or the like, which is read by and written to byremovable storage drive 614. As will be appreciated by persons skilledin the relevant art(s), the removable storage medium 644 includes acomputer readable storage medium having stored therein computerexecutable program code instructions and/or data.

In an alternative implementation, the secondary memory 610 mayadditionally or alternatively include other similar means for allowingcomputer programs or other instructions to be loaded into the computingdevice 600. Such means can include, for example, a removable storageunit 622 and an interface 640. Examples of a removable storage unit 622and interface 640 include a program cartridge and cartridge interface(such as that found in video game console devices), a removable memorychip (such as an EPROM or PROM) and associated socket, a removable solidstate storage drive (such as a USB flash drive, a flash memory device, asolid state drive or a memory card), and other removable storage units622 and interfaces 640 which allow software and data to be transferredfrom the removable storage unit 622 to the computer system 600.

The computing device 600 also includes at least one communicationinterface 624. The communication interface 624 allows software and datato be transferred between computing device 600 and external devices viaa communication path 626. In various embodiments of the disclosure, thecommunication interface 624 permits data to be transferred between thecomputing device 600 and a data communication network, such as a publicdata or private data communication network. The communication interface624 may be used to exchange data between different computing devices 600which such computing devices 600 form part an interconnected computernetwork. Examples of a communication interface 624 can include a modem,a network interface (such as an Ethernet card), a communication port(such as a serial, parallel, printer, GPIB, IEEE 1394, RJ45, and USB),an antenna with associated circuitry and the like. The communicationinterface 624 may be wired or may be wireless. Software and datatransferred via the communication interface 624 are in the form ofsignals which can be electronic, electromagnetic, optical, or othersignals capable of being received by communication interface 624. Thesesignals are provided to the communication interface via thecommunication path 626.

As shown in FIG. 6, the computing device 600 further includes a displayinterface 602 which performs operations for rendering images to anassociated display 630 and an audio interface 632 for performingoperations for playing audio content via associated speaker(s) 634.

As used herein, the term “computer program product” (or computerreadable medium, which may be a non-transitory computer readable medium)may refer, in part, to removable storage medium 644, removable storageunit 622, a hard disk installed in storage drive 612, or a carrier wavecarrying software over communication path 626 (wireless link or cable)to communication interface 624. Computer readable storage media (orcomputer readable media) refers to any non-transitory, non-volatiletangible storage medium that provides recorded instructions and/or datato the computing device 600 for execution and/or processing. Examples ofsuch storage media include magnetic tape, CD-ROM, DVD, Blu-ray™ Disc, ahard disk drive, a ROM or integrated circuit, a solid state storagedrive (such as a USB flash drive, a flash memory device, a solid statedrive or a memory card), a hybrid drive, a magneto-optical disk, or acomputer readable card such as a PCMCIA card and the like, whether ornot such devices are internal or external of the computing device 600.Examples of transitory or non-tangible computer readable transmissionmedia that may also participate in the provision of software,application programs, instructions and/or data to the computing device600 include radio or infra-red transmission channels as well as anetwork connection to another computer or networked device, and theInternet or Intranets including e-mail transmissions and informationrecorded on Websites and the like.

The computer programs (also called computer program code) are stored inmain memory 608 and/or secondary memory 610. Computer programs can alsobe received via the communication interface 624. Such computer programs,when executed, enable the computing device 600 to perform one or morefeatures of embodiments discussed herein. In various embodiments, thecomputer programs, when executed, enable the processor 604 to performfeatures of the above-described embodiments. Accordingly, such computerprograms represent controllers of the computer system 600.

Software may be stored in a computer program product and loaded into thecomputing device 600 using the removable storage drive 614, the storagedrive 612, or the interface 640. The computer program product may be anon-transitory computer readable medium. Alternatively, the computerprogram product may be downloaded to the computer system 600 over thecommunications path 626. The software, when executed by the processor604, causes the computing device 600 to perform functions of embodimentsdescribed herein.

It is to be understood that the embodiment of FIG. 6 is presented merelyby way of example. Therefore, in some embodiments one or more featuresof the computing device 600 may be omitted. Also, in some embodiments,one or more features of the computing device 600 may be combinedtogether. Additionally, in some embodiments, one or more features of thecomputing device 600 may be split into one or more component parts. Themain memory 608 and/or the secondary memory 610 may serve(s) as thememory for the file splitting device 202, file splitter 304 orpre-processor; while the processor 604 may serve as the processor of thefile splitting device 202, file splitter 304 or pre-processor.

Some portions of the description herein are explicitly or implicitlypresented in terms of algorithms and functional or symbolicrepresentations of operations on data within a computer memory. Thesealgorithmic descriptions and functional or symbolic representations arethe means used by those skilled in the data processing arts to conveymost effectively the substance of their work to others skilled in theart. An algorithm is here, and generally, conceived to be aself-consistent sequence of steps leading to a desired result. The stepsare those requiring physical manipulations of physical quantities, suchas electrical, magnetic or optical signals capable of being stored,transferred, combined, compared, and otherwise manipulated.

Unless specifically stated otherwise, and as apparent from thedescription herein, it will be appreciated that throughout the presentspecification, discussions utilizing terms such as “receiving”,“splitting”, “identifying”, “scanning”, “calculating”, “determining”,“replacing”, “generating”, “initializing”, “outputting”, or the like,refer to the action and processes of a computer system, or similarelectronic device, that manipulates and transforms data represented asphysical quantities within the computer system into other data similarlyrepresented as physical quantities within the computer system or otherinformation storage, transmission or display devices.

The present specification also discloses apparatus for performing theoperations of the methods. Such apparatus may be specially constructedfor the required purposes, or may include a computer or other deviceselectively activated or reconfigured by a computer program stored inthe computer. The algorithms and displays presented herein are notinherently related to any particular computer or other apparatus.Various machines may be used with programs in accordance with theteachings herein. Alternatively, the construction of more specializedapparatus to perform the required method steps may be appropriate. Thestructure of a computer suitable for executing the variousmethods/processes described herein will appear from the descriptionherein.

In addition, the present specification also implicitly discloses acomputer program, in that it would be apparent to the person skilled inthe art that the individual steps of the method described herein may beput into effect by computer code. The computer program is not intendedto be limited to any particular programming language and implementationthereof. It will be appreciated that a variety of programming languagesand coding thereof may be used to implement the teachings of thedisclosure contained herein. Moreover, the computer program is notintended to be limited to any particular control flow. There are manyother variants of the computer program, which can use different controlflows without departing from the spirit or scope of the disclosure.

Furthermore, one or more of the steps of the computer program may beperformed in parallel rather than sequentially. Such a computer programmay be stored on any computer readable medium. The computer readablemedium may include storage devices such as magnetic or optical disks,memory chips, or other storage devices suitable for interfacing with acomputer. The computer readable medium may also include a hard-wiredmedium such as exemplified in the Internet system, or wireless mediumsuch as exemplified in the GSM mobile telephone system. The computerprogram when loaded and executed on such a computer effectively resultsin an apparatus that implements the steps of the preferred method.

As will be appreciated based upon the foregoing specification, theabove-described embodiments of the disclosure may be implemented usingcomputer programming or engineering techniques including computersoftware, firmware, hardware or any combination or subset thereof. Anysuch resulting program, having computer-readable code means, may beembodied or provided within one or more computer-readable media, therebymaking a computer program product, i.e., an article of manufacture,according to the discussed embodiments of the disclosure. Thecomputer-readable media may be, for example, but is not limited to, afixed (hard) drive, diskette, optical disk, magnetic tape, semiconductormemory such as read-only memory (ROM), and/or any transmitting/receivingmedium, such as the Internet or other communication network or link. Thearticle of manufacture containing the computer code may be made and/orused by executing the code directly from one medium, by copying the codefrom one medium to another medium, or by transmitting the code over anetwork.

These computer programs (also known as programs, software, softwareapplications, “apps”, or code) include machine instructions for aprogrammable processor, and can be implemented in a high-levelprocedural and/or object-oriented programming language, and/or inassembly/machine language. As used herein, the terms “machine-readablemedium” “computer-readable medium” refers to any computer programproduct, apparatus and/or device (e.g., magnetic discs, optical disks,memory, Programmable Logic Devices (PLDs)) used to provide machineinstructions and/or data to a programmable processor, including amachine-readable medium that receives machine instructions as amachine-readable signal. The “machine-readable medium” and“computer-readable medium,” however, do not include transitory signals.The term “machine-readable signal” refers to any signal used to providemachine instructions and/or data to a programmable processor.

According to various embodiments, a “circuit” may be understood as anykind of a logic implementing entity, which may be special purposecircuitry or processor executing software stored in a memory, firmware,or any combination thereof. Thus, in an embodiment, a “circuit” may be ahard-wired logic circuit or a programmable logic circuit such as aprogrammable processor, e.g. a microprocessor (e.g. a ComplexInstruction Set Computer (CISC) processor or a Reduced Instruction SetComputer (RISC) processor). A “circuit” may also be a processorexecuting software, e.g. any kind of computer program, e.g. a computerprogram using a virtual machine code such as e.g. Java. Any other kindof implementation of the respective functions which will be described inmore detail below may also be understood as a “circuit” in accordancewith an alternative embodiment.

As used herein, the terms “software” and “firmware” are interchangeable,and include any computer program stored in memory for execution by aprocessor, including RAM memory, ROM memory, EPROM memory, EEPROMmemory, and non-volatile RAM (NVRAM) memory. The above memory types areexample only, and are thus not limiting as to the types of memory usablefor storage of a computer program.

In one embodiment, a computer program is provided, and the program isembodied on a computer readable medium. In an exemplary embodiment, thesystem is executed on a single computer system, without requiring aconnection to a sever computer. In a further embodiment, the system isbeing run in a Windows® environment (Windows is a registered trademarkof Microsoft Corporation, Redmond, Wash.). In yet another embodiment,the system is run on a mainframe environment and a UNIX® serverenvironment (UNIX is a registered trademark of X/Open Company Limitedlocated in Reading, Berkshire, United Kingdom). The application isflexible and designed to run in various different environments withoutcompromising any major functionality.

In some embodiments, the system includes multiple components distributedamong a plurality of computing devices. One or more components may be inthe form of computer-executable instructions embodied in acomputer-readable medium. The systems and processes are not limited tothe specific embodiments described herein. In addition, components ofeach system and each process can be practiced independent and separatefrom other components and processes described herein. Each component andprocess can also be used in combination with other assembly packages andprocesses.

As used herein, an element or step recited in the singular and precededby the word “a” or “an” should be understood as not excluding pluralelements or steps, unless such exclusion is explicitly recited.Furthermore, references to “example embodiment” or “one embodiment” ofthe present disclosure are not intended to be interpreted as excludingthe existence of additional embodiments that also incorporate therecited features.

The patent claims at the end of this document are not intended to beconstrued under 35 U.S.C. § 112(f) unless traditionalmeans-plus-function language is expressly recited, such as “means for”or “step for” language being expressly recited in the claim(s).

It will be appreciated by a person skilled in the art that numerousvariations and/or modifications may be made to the present disclosure asshown in the specific embodiments without departing from the spirit orscope of the disclosure as broadly described. The present embodimentsare, therefore, to be considered in all respects to be illustrative andnot restrictive.

This written description uses examples to disclose the disclosure,including the best mode, and also to enable any person skilled in theart to practice the disclosure, including making and using any devicesor systems and performing any incorporated methods. The patentable scopeof the disclosure is defined by the claims, and may include otherexamples that occur to those skilled in the art. Such other examples areintended to be within the scope of the claims if they have structuralelements that do not differ from the literal language of the claims, orif they include equivalent structural elements with insubstantialdifferences from the literal language of the claims.

What is claimed is:
 1. A data processing method, the method implemented by at least a first processor and a second processor in communication with at least one memory, the method comprising: receiving, by the first processor, a data file and a layout file, wherein the data file includes a plurality of data sets including a plurality of entries, wherein each entry resides at a respective predetermined location in the data file, and wherein the layout file includes information indicating the respective predetermined locations for a plurality of entry categories that correspond to the plurality of entries; receiving, by the first processor, an input including a search term and a search category; determining, by the first processor based on the layout file, a search location for which an entry category matches the search category; identifying, by the first processor, data sets in the data file having an entry that resides at the determined search location and that match the search term; dividing, by at least the first processor, the data file into a plurality of output files based on the identified data sets; and parallelly processing, by the first processor and the second processor, the plurality of output files in a parallel processing system.
 2. The method in accordance with claim 1, wherein the input further comprises a second search term and a second search category, and wherein the method further comprises: determining a second search location based on the second search category and the layout file; and identifying data sets in the data file having a first entry that resides at the determined first search location and that matches the first search term and that include a second entry that resides at the second search location and matches the second search term.
 3. The method in accordance with claim 1, wherein the first search category comprises one of a country or a state.
 4. The method in accordance with claim 1, wherein the first search term comprises one of a country code or a state code.
 5. The method in accordance with claim 1 further comprising extracting, from the data file, the identified data sets to generate the output file.
 6. The method in accordance with claim 1, wherein the predetermined locations in the layout file comprises a starting position and data length for each entry category.
 7. The method in accordance with claim 1, wherein each of the plurality of output files has a difference in file size not exceeding 20%.
 8. The method in accordance with claim 1, wherein each of the plurality of output files has a difference in file size not exceeding one data set.
 9. The method in accordance with claim 1, wherein each of the plurality of output files includes a same number of data sets.
 10. A computer device for improving the processing of a data file by automatically dividing the data file into manageable sections, the computer device comprising: a receiver configured to receive a data file, a layout file, and an input, wherein the data file includes a plurality of data sets including a plurality of entries, wherein each entry resides at a respective predetermined location in the data file, wherein the layout file includes information indicating the respective predetermined locations for a plurality of entry categories that correspond to the plurality of entries, and wherein the input includes a first search term and a first search category; a determination circuit configured to determine, based on the layout file, a first search location for which an entry category matches the first search category; an identification circuit configured to identify data sets in the data file having an entry that resides at the determined first search location and that match the first search term; and a generator configured to generate an output file comprising the identified data sets.
 11. The computer device in accordance with claim 10, wherein the input further comprises a second search term and a second search category, wherein the determination circuit is further configured to determine a second search location based on the second search category and the layout file, and wherein the identification circuit is further configured to identify data sets in the data file having a first entry that resides at the determined first search location and that matches the first search term and that include a second entry that resides at the second search location and matches the second search term.
 12. The computer device in accordance with claim 10, wherein the search category comprises one of a country or a state.
 13. The computer device in accordance with claim 10, wherein the search term comprises one of a country code or a state code.
 14. The computer device in accordance with claim 10, wherein the generator is further configured to extract, from the data file, the identified data sets to generate the output file.
 15. The computer device in accordance with claim 10, wherein the predetermined locations in the layout file comprises a starting position and data length for each entry category.
 16. The computer device in accordance with claim 10, wherein the computer device is configured to generate a plurality of output files, and wherein the computer device further comprises a parallel processing system configured to parallelly process the plurality of output files.
 17. A data processing method, the method implemented by a computer device comprising at least one processor in communication with at least one memory, the method comprising: receiving a data file; splitting the data file into a plurality of country level files; splitting each country level file into a plurality of zone files; splitting each zone file into a plurality of sub-zone files; parallelly processing the plurality of sub-zone files in a parallel processing system; and recombining the plurality of parallelly processed sub-zone files to form respective processed zone files.
 18. The method in accordance with claim 17, wherein each of the plurality of output files has a difference in file size not exceeding 20%.
 19. The method in accordance with claim 17, wherein each of the plurality of output files has a difference in file size not exceeding one data set.
 20. The method in accordance with claim 17, wherein each of the plurality of output files includes a same number of data sets. 